Universal Bounds on the Scaling Behavior of Polar 

Codes 

Ali Goli, S. Hamed Hassani and Riidiger Urbanke 



(N 

o 

>>: 



CO 



> 
00 

o 



X 



Abstract — We consider the problem of determining the trade- 
off between the rate and the block-length of polar codes for 
a given block error probability when we use the successive 
cancellation decoder. We take the sum of the Bhattacharyya 
parameters as a proxy for the block error probability, and show 
that there exists a universal parameter fi such that for any binary 
memory less symmetric channel W with capacity I{W), reliable 
communication requires rates that satisfy R < I(W) — aN~~, 
where a is a positive constant and TV is the block-length. We 
provide lower bounds on fj,, namely /i > 3.553, and we conjecture 
that indeed n = 3.627, the parameter for the binary erasure 
channel. 



I. Introduction 

Polar coding schemes provably achieve the capacity of a 
wide array of channels including binary memoryless symmet- 
ric (BMS) channels. Let W be a BMS channel with capacity 
I{W). In [1], Arikan showed that for any rate R < I{W) 
the block error probability of the successive cancellation (SC) 
decoder is upper bounded by /V -1 / 4 for block-length N large 
enough. In [2], Arikan and Telatar significantly tightened this 
result. They showed that for any rate R < I(W) and any 



< 



the block error probability is upper bounded by 2 



for N large enough. Later in [3], these bounds were refined to 
be dependent on R and it was shown that similar asymptotic 
lower bounds are valid when we perform MAP decoding. 
Hence, SC and MAP decoders share the same asymptotic 
performance in this sense. Such an exponential decay suggests 
that error floors should not be a problem for polar codes even 
at moderate block lengths (e.g. N > 10 4 ). 

Another problem of interest in the area of polar codes is 
to determine the trade-off between the rate and the block- 
length for a given error probability when we use the successive 
cancellation (SC) decoder. In other words, in order to have 
reliable transmission with block error probability at most e, 
how does the maximum possible rate R scale in terms of the 
block-length Nl This problem has been previously considered 
in [4] and [5] mainly for the family of Binary Erasure Channels 
(BEC). In both [4] and [5], the authors provide strong evidence 
(both numerically and analytically) that for polar codes with 
the SC decoder, reliable communication over the BEC requires 
rates N~~ below capacity, where fi w 3.627. 

In this paper, we provide rigorous lower bounds on the value 
of /i, such that for any BMS channel W, reliable transmission 
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(in the sense that the sum of the Bhattacharyya parameters is 
small) requires rates at least N~~ below capacity. We begin 
by giving the notation and the general problem set-up. 

A. Periminilaries 

Let W : X — > y be a BMS channel, with input alphabet 
X = {0, 1}, output alphabet y, and the transition probabilities 
{W(y | x) : x <E X,y e y}. We consider the following three 
parameters for the channel W, 



H(W) 



E 

y&y 

E 



W(y 1 1) log 



W(y\l) + W(y\0) 
W(y\l) 



Z(W) = V y/W(y\0)W(y\l), 



E{W) 



Ww{y\l)e-^ 



WlV I 1) I ln W(y | 1) , . 
W(y | 0) "r I 111 W(y | 0) I > 



(1) 



(2) 



(3) 



y&y 



The parameter H(W) is equal to the entropy of the output 
of W given its input when we assume uniform distribution 
on the inputs, i.e., H(W) = H(X | Y). Hence, we call the 
parameter H(W) the entropy of the channel W . Also note 
that the capacity of W, which we denote by I(W), is given 
by I(W) = 1 - H(W). The parameter Z(W) is called the 
Bhattacharyya parameter of W and E(W) is called the error 
probability of W . It can be shown that E(W) is equal to 
the error probability in estimating the channel input x on the 
basis of the channel output y via the maximum-likelihood 
decoding of W(y\x) (with the further assumption that the input 
has uniform distribution). It can be shown that the following 
relations hold between these parameters (see for e.g., [1] and 
[6, Chapter 4]): 



< 2E(W) < H{W) < Z(W) < 1, 
H{W) < h 2 (E{W)), 



Z(W) < y/l-(l-H(W))\ 
where /i2(-) denotes the binary entropy function. 



(4) 
(5) 
(6) 



B. Channel transform 

Let W denote the set of all the BMS channels and consider 
a transform W -> (W~ ,W + ) that maps W to W 2 in the 
following manner. Having the channel W : {0,1} — > y, the 
channels W~ : {0, 1} -> y 2 and W + : {0, 1} {0, 1} x y 2 
are defined as 

W-(y 1 ,y 2 \x 1 )= Y, \w{yi\xi®x 2 )W{y 2 \x 2 ) (7) 
z 2 e{o,i} 
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W+(y 1 ,y 2t x 1 \x2) = ^W(y 1 \x 1 ®x 2 )W{y 2 \x 2 ), (8) 
A direct consequence of the chain rule of entropy yields 
H(W+) + H(W- 



= H{W) 



(9) 



One can also show that, 

H(W) < H(W~) < 1 — (1 — H(W)) 2 , (10) 



H(W) 2 < H(W + ) < H(W). 



(11) 



C. Polarization process 

Let {B n , n > 1} be a sequence of iid Bernoulli(i) random 
variables. Denote by 0,P) the probability space generated 
by this sequence and let (J- n , f2 n , P„) be the probability space 
generated by (B±, ■ • • , B n ). For a BMS channel W, define a 
random sequence of channels W n , n G N = {0, 1, 2, • • ■ }, as 
Wo = W and 



* * n 

w: 



If B n 

If B„ 



(12) 



where the channels on the right side are given by the transform 
W n -i — > (W~_i,W„_i). Let us also define the random pro- 
cesses {H n } n( z K , {/„}„ 6N and {Z„}„ eN as H n = H(W n ), 
In = I(W n ) = 1 - H(W n ) and Z„ = Z(W n ). From (9) 
one can easily observe that H n (and /„) is a martingale 
with E[H n ] = H(W). It is further known from [1] that 
the processes H n and Z n converge almost surely to limit 
random variables and and furthermore, these limit 
random variables take their values in the set {0, 1} with 
Pr(i7 oc = 0) = Pr(Z co = 0) = H(W). 

D. Polar codes 

Given the rate R < I(W), polar coding is based on 
choosing a set of 2 n R rows of the matrix G„ = [} o]®" 
to form a 2™fl x 2™ matrix which is used as the generator 
matrix in the encoding procedure 1 . The way this set is chosen 
is dependent on the channel W and is briefly explained as 
follows: At time n G N, consider a specific realization of the 
sequence {B\, • ■ ■ , B n ), and denote it by (6i, ■ - • , b n ). The 
random variable W n outputs a BMS channel, according to the 
procedure (12), which we can naturally denote by W^ bl '"' ' bn \ 
Let us now identify a sequence - ,b n ) by an integer 

i in the set {1, • • ■ ,N} such that the binary expansion of 
i — 1 is equal to the sequence (&].,•■• , b n ), with b\ as the 
least significant bit. As an example for n = 3, we identify 
(61,62,63) = (0,0,1) with 5 and (b u b 2 ,b 3 ) = (1,0,0) with 
2. To simplify notation, we use W n l) to denote W^<- > K \ 
Given the rate R, the indices of the matrix G n are chosen as 
follows: Choose a subset of size NR from the set of channels 
{Wn sl<i<N that have the least possible error probability 
(given in (3)) and choose the rows G n with the same indices 

( i) 

as these channels. E.g., if the channel Wj^ is chosen, then 
the 7-th row of G n is selected. In the following, given N, we 

'There are extensions of polar codes given in [7] which use different kinds 
of matrices. 



call the set of indices of NR channels with the least error 
probability, the set of good indices and denote it by In, li- 
lt is proved in [1] that the block error probability of 
such polar coding scheme under SC decoding, denoted by 
P e (N, R), is bounded from both sides by 2 



max E(W$) < P e (N,R) < 



E. Main results 



Z^ 



E(W { ^). (13) 



Consider a BMS channel W and let us assume that a polar 
code with block-error probability at most a given value e > 
0, is required. One way to accomplish this is to ensure that 
the right side of (13) is less than e. However, this is only a 
sufficient condition that might not be necessary. Hence, we call 
the right side of (13) the strong reliability condition. Based on 
this measure of the block-error probability, we provide bounds 
on how the rate R scales in terms of the block-length 7Y. 

Theorem 1: For any BMS channel W with capacity 
I(W) G (0, 1), there exist constants e, a > 0, which depend 
only on I(W), such that 



E 



E(W$) < e, 



implies 



R<I(W)-^t, 



(14) 



(15) 



where /x is a universal parameter lower bounded by 3.553. ■ 
A few comments are in order: 

1) As we will see in the sequel, we can obtain an increasing 
sequence of lower bounds, call this sequence {/i m } m £N, for 
the universal parameter fj,. For each m, in order to show the 
validity of the lower bound we need to verify the concavity 
of a certain polynomial (defined in (20)) in [0,1]. For small 
values of m concavity can be proved directly using pen and 
paper. For larger values of m we can automate this process: 
each polynomial has rational coefficients. Hence also its 
second derivative has rational coefficients. To show concavity 
it suffices to show that there are no roots of this second 
derivative in [0, 1]. This task can be accomplished exactly 
by computing so-called Sturm chains (see Sturm's Theorem 
[8]). Computing Sturm chains is equivalent to running Euclid's 
algorithm starting with the second and third derivative of 
the original polynomial. The lower bound for fi stated in 
Theorem 1 is the one corresponding to m = 8, an arbitrary 
choice. If we increase m we get e.g., fiiQ = 3.614. We 
conjecture that the sequence \i m converges to [i = 3.627, the 
parameter for the BEC. 

2) Let e, a, [i be as in Theorem 1. If we require the block- 
error probability to be less than e (in the sense that the 
condition (14) is fulfilled), then the block-length should 
be at least 



N > ( 



I(W) - R' 



(16) 



Note here that by (3) the error probability of a BMS channel is less that 
its Bhattacharyya value. Hence, the right side of (13) is a better upper bound 
for the block error probability than the sum of the Bhattacharyya values. 
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3) It is well known that the value of [i for the random 
linear ensemble is fi = 2, which is the optimal value since the 
variations of the channel itself require /j, > 2. Thus, given a 
block-length N, reliable transmission by polar codes requires 
a larger gap to the channel capacity than the optimal value. 

The rest of the paper is devoted to proving Theorem 1. In 
Section II, we provide universal lower bounds on how fast 
the process H n converges to its limit Hoq. We then use these 
bounds to prove Theorem 1 in Section III. Finally, Section IV 
concludes the paper with stating the related open questions. 

II. Universal Lower bounds on the speed of 

POLARIZATION 

Consider a channel W with its entropy process H n = 
H(W n ). Since the bounded process H n converges al- 
most surely to a — 1 valued random variable, we have 
lim„_ i . 00 E[7J„(1 — H n )} = 0. In this section, we provide 
universal lower bounds on the speed with which the quantity 
E[i/ n (l — H n )\ decays to 0. We first derive such lower bounds 
for the family of Binary Erasure Channels (BEC) and then 
extend them to other BMS channels. 

A. Binary erasure channel 

Consider a binary erasure channel with erasure probability 
h G [0, 1] which we denote by BEC(h). One can show that 
(see [6, Chapter 4] ) for such a channel we have 

H(BEC(h)) = Z(BEC(h)) = 2E(BEC(h)) = h. (17) 

Furthermore, we have 

(BEC(7i))+ =BEC(/i 2 ), 

(BECO)) - = BE C(1 - (1 - ^) 2 ), 

both proved in [1]. Hence, the processes H n and Z n for 
BEC(/i) are equal and have a simple closed form expression 
as the following: Let Hq = h and 3 



o 



0.75 0.7897 0.8075 0.8190 0.8228 



If B n 1, 

2 j f d _ n y l *> 



i-(i-H n ^y, ifs„ = o 



Let us now define the sequence of functions {f n (h)} n eN as 
/„ : [0, 1] -> [0, 1] and for h G [0, 1], 



f n (h)=E[H n (l-H n )} 



(19) 



Here, note that for h G [0, 1] the value of f n (h) is a 
deterministic value that is dependent on the process H n with 
the starting value H n = h. By using the recursive relation 
(18), one can easily deduce that 

fo(h) = h(l - h), (20) 

f n (h) = /"-i( fe2 ) + /"-i( 1 -( 1 - ft ) 2 ) - 

Let us also define a sequence of numbers {a m } mS N as 

. f fm+l(h) 

he[o,i] Jm(h) 

3 Note that to simplify notation we have dropped the dependency of H n to 
its stalling value Ho = h. 



1 1 in 



2.409 



2.935 3.241 
TABLE I 



3.471 



3.553 



Remark 2: One can compute the value of a m by finding 
the extreme points of the function (i.e., finding the 

roots of the polynomial g m = f' m+ ifm - /m+i/' m ) and 
checking which one gives the global minimum. Again, for 
small values e.g., m = 0,1, pen and paper suffice. For 
higher values of m we can again automatize the process: all 
these polynomials have rational coefficients and therefore it is 
possible to determine the number of real roots exactly and to 
determine their value to any desired precision (by computing 
Sturm chains as mentioned earlier). Hence, we can find the 
value of a m to any desired precision. Table I contains the 
numerical value of a m up to precision 10~ 4 for m < 8. As 
the table shows, the values a m are increasing (see Lemma 3), 
and we conjecture that they converge to 2 _ 3.62713 = 0.8260, 
the corresponding value for the channel BEC. 
We now show that each of the values a m is a lower bound on 
the speed of decay of the sequence f n . 

Lemma 3: Fix m G N. For all n > m and h G [0,1], we 
have 

(a m ) n - m fm(h) < U(h). (22) 

Furthermore, the sequence a m is an increasing sequence. 

Proof: The proof goes by induction on n—m. For n—m = 
the result is trivial. Now, assume that the relation (22) holds 
for a n — m = k, i.e., for h G [0, 1] we have 



) f m (h) < f m +k(h) 



(23) 



We show that (22) is indeed true for k + 1 and h G [0, 1]. We 
have 

, ,,v (a) f m+k (h 2 ) + f m+k (l - (1 - hf ) 
Jm+fc+lW — 2 

W (a m ) k f m (h 2 ) + (a m ) k f m (l - (1 - hf) 



> 



(a m ) k f m+1 (h) 
(a m ) f m {h) 



fm(h) 

> {a m ) k [ inf 



/m+l(fe) - 

L /ie"[b"i] f m (h) ■ 
(a m ) k+1 f m (h). 



fm(h) 



Here, (a) follows from (20) and (b) follows from the left 
side inequality in (23), and hence the lemma is proved via 
induction. ■ 

B. BMS Channels 

For a BMS channel W, there is no simple 1-dimensional 
recursion for process H n as for BEC. However, by using 
(10) and (11), one can give bounds on how H n evolves. In 
this section, we use the functions {/ n } ng N defined in (20) to 
provide universal lower bounds on the quantity K[H n (l— H n )}. 
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We start by introducing one further technical condition given 
as follows. 

Definition 4: We call an integer m G N suitable if the 
function f m (h), defined in (20), is concave on [0, 1]. 

Remark 5: For small values of m, i.e., m < 2, it is easy to 
verify by hand that the function f m is concave. As discussed 
previously, for larger values of m we can use Sturm's theorem 
[8] and a computer algebra system to verify this claim. Note 
that the polynomials f m have integer coefficients. Hence, all 
the required computations can be done exactly. Unfortunately, 
the degree of f m is 2 m+1 . We have checked up to m = 8 that 
f m is concave and we conjecture that in fact this is true for 
all m G N. 
In the rest of this section, we show that for any BMS channel 
W, the value of a m is a lower bound on the speed of decay 
of H n provided that m is a suitable integer. 

Lemma 6: Let m 6 N be a suitable integer and W a BMS 
channel. We have for n> m 



E[H n {\ - H n )} > (a m ) n - m f m (H(W)), 



(24) 



where a m is given in (21). 

Proof: We use induction on n — m: For n — m = there 
is nothing to prove. Now, assume that the result of the lemma 
is correct for n — m — k. Hence, for any BMS channel W 
with H n = H(W n ) we have 



E[H m+k {\ - H m+k )] > (a m ) k f m (H(W)). 



(25) 



We now prove the lemma for m — n = k + 1. For the 
BMS channel W, let us recall that the the transform (W — > 
(W~ ,W + )) yields two channels W~ and W + such that 
the relation (9) holds. Define the process {(W~) n ,n G N} 
as the channel process that starts with (W~)o = W~ and 
evolves as in (12) similarly define {(W + ) n ,n G N} similar 
with (W + )q = W + . Let us also define the two processes 
H~ = H((W-) n ) and H+ = H((W+)J. We have, 

E[-ff m +fc+i(l — H rn+ k+i)] 
00 - H~ +k )] + E[H m+k (l H+ +k )) 



> (a m y 



J m (H(W-)) + f m (H(W+j) 



. k f m (l-(l-H(W))*) + f m (H(W) 2 ) 

— \ a m) 2 

= (a m ) k f m+1 (H(W)) 

~ M f m (H(W)) UH(W)) 
>(a m ) k [ inf i %±iM- 

(e) 



fm(H(W)) 



(a m ) m+1 frn(H(W)). 



In the above chain of inequalities, relation (a) follows from 
the fact that W m has 2™ possible outputs among which 
half of them are branched out from W + and the other half 
are branched out from W~ . Relation (b) follows from the 
induction hypothesis given in (25). Relation (c) follows from 
(10), (11) and the fact that the function f m is concave. More 



precisely, since f m is concave on [0, 1], we have the following 
inequality for any sequence of numbers 0<x'<a;<y< 
y' < 1 that satisfy ^ = 



fm(x') + f m (y') / fm(x) + f m (y) 



< 



(26) 



In particular, we set x' = H(Wy 



H(W+), y = 



H{W~), y' = 1 - (1 - H{W)) 2 and we know from (10) and 
(11) that < x' < x < y < y' < 1. Hence, by (26) we obtain 
(c). Relation (d) follows from the recursive definition of f m 
given in (20). Finally, relation (e) follows from the definition 
of a m given in (21). ■ 

III. Proof of Theorem 1 

To fit the bounds of Section II into the framework of 
Theorem 1, let us first introduce the sequence {/x m } m gN as 

1 



log a m ' 



(27) 



where a m is defined in (21). In the last section, we proved that 
for a suitable m, the speed with which the quantity ¥,[H n (l — 
H n )\ decays is lower bounded by a m = 2"~ , i.e. for n > m 

we have E[H n (l - H n )\ > 2~ {J T^ 1 f m (H{W)). To relate the 
strong reliability condition in (14) to the rate bound in (15), 
we need the following lemma. 

Lemma 7: Consider a BMS channel W and assume that 
there exist positive real numbers 7, 9 and m G N such that 
E[H n (l - H n )] > j2- nS for n > m. Let a,/3 > be such 
that 2a + (3 = 7, we have for n>m 



Pr(7J„ < a2~ nU ) < I(W) - /32 



-n() 



(28) 



Proof: The proof is by contradiction. Let us assume the 
contrary, i.e., we assume there exists n > m s.t., 



Pr( J ff n < a2~ ne ) > I(W) - [32 



-n$ 



(29) 



In the following, we show that with such an assumption we 
reach to a contradiction. We have 

E[H n {l-H n )} 

= E[H n (l - H n ) I H n < a2~ ne ]Vr{H n < a2- nS ) 

+ E[H n (l - Hn) I H n > a2- n6 }Pi(H n > a2- n9 ). (30) 

It is now easy to see that 

E[H n {l - H n ) I H n < a2- ne ] < a2~ n9 , 

and since E[H n (l - H n )] > 1 2- n9 , by using (30) we get 

E[H n (l-H n ) I H n > a2- nS ]Pr(H n > a2~ nd ) > 2-" e ( 7 -a). 

(31) 

We can further write 

E[(l - H n )] = E[l - H n I H n < a2- ne ]Pr(H n < a2- n6 ) 

+ E[l - Hn I H n > a2- n9 }Pi(H n > «2-" e ), 

(32) 

and by noting the fact that H n > H n (l — H n ) we can plug 
in (31) in (32) to obtain 

E[(l - Hn)} > E[l - H n I H n < a2-" e ]Pr(i/„ < a2~ n6 ) 
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+ 2- n0 {j-a). (33) 

We now continue by using (29) in (33) to obtain 

E[(l - H n )] > (I(W) - (32~ n6 ){l - a2- nB ) + 2~ n9 (>y - a) 
> I(W) + 2-™ e ( 7 - a(l + I{W)) - 0), 

and since 2a + (5 = 7, we get E[l — H n ] > I(W). However, 
this is a contradiction since H n is a martingale and E[l — 

H n ] = I{W). ■ 

Let us now use the result of Lemma 7 to conclude the proof 
of Theorem 1 . By Lemma 6, we have for n > m 

E[H n {l - H n )] > 2~ L ^ 1 f m (H(W)) 

= 2-^(2^t f m (H(W))). 

Thus, if we now let 

y = 2*tf m (H(W)), 
2a = P = ^, 
then by using Lemma 7 we obtain 

Pr(H n < ^2"i£) <I(W)-l2~^. (34) 
Now, assume we desire to achieve a rate R equal to 

R = I(W)-~2-i%;. (35) 

Let In,r be the set of indices chosen for such a rate R, i.e., 
In,R includes the 2 n R indices of the sub-channels with the 
least value of error probability. Define the set A as 

A = {i G l nJi : H(W$) > Jr*}. (36) 

In this regard, note that (34) and (35) imply that 

\A\> l2 n ^-Tt), (37) 

and as a result, by using (4) and (5) we obtain 

> - 1 — , 

- 16 8n -i- ' 

where the last step follows from the fact that for x £ [0, ;4=], 
we have (x) > 81o ^is ■ Thus, having a block-length N = 
2 n , in order to get to block-error probability (measured by 

2 ,n(l-a 

(13)) less than -nr^ r- 12 —, the rate can be at most R = 

v " 16 8ri — — 

I(W)-p-T%. 

Finally, if we let m — 8 (by the discussion in Remark 5, 
we know that m = 8 is suitable), then ^ig = _ lQ g( as ) = 3.553 
and choosing 

e = ^ f J.E (38) 

where R is given in (35), then we know for sure that 
e > (since — > 2) and furthermore, to have block-error 
probability less that e the rate should be less than R given in 
(35). 



IV. Open problems 

The results of this paper can be extended in the following 
ways. 

1) In this paper, we take the right side of (13) as a proxy 
for the block error probability and hence our results are with 
respect to the strong reliability condition (14). A significant 
step in this regard would be to prove equivalent bounds for 
the block error probability. 

2) Another way to improve the results of this paper is to 
provide better values of the universal parameter /1. Based on 
numerical experiments, we conjecture that the value of // can 
be increased up to the scaling parameter of the channel BEC. 
That is, the right value of \x to plug in (15) is equal to \i = 
3.62713. Thus, the ultimate goal would be to show that for the 
channel BEC, the polarization phenomenon takes place faster 
than all the other BMS channels. One way to do this, is to 
prove that the functions /„ defined in (20) are concave on the 
interval [0,1]. 

3) The result of Theorem 1 suggests that in terms of finite- 
length performance, polar codes are far from optimal. How- 
ever, we might get different results if we consider extended 
polar codes with £ x £ kernels ([7]). It is not very hard to 
prove that at least for the BEC, as £ grows large, for almost all 
the £ x £ kernels the finite-length performance of polar codes 
improves towards the optimal one (i.e., fi — )• 2). However, 
this is at the cost of an increase in complexity proportional 
to 2 e . This suggests that there might still exist kernels with 
reasonable size with superior finite-length properties than the 
original 2x2 kernel. Hence, an interesting open problem is 
the finite-length analysis of polar codes that are constructed 
from £ x £ kernels and relate such analysis to finding kernels 
with better finite-length properties. 
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